For much of human existence, child birth has been an extraordinarily risky endeavor, often ending with the death of either the mother or baby. With the advent of modern medicine and technology, much of the world has seen a significant reduction in the rate of child berth related deaths and of child deaths. While this is something to applaud, there is still much work to be done. Child development after berth is a crucial period during which many of the neurological and motor skills necessary for a full and healthy life ossify. Anything parents can do to increase the healthiness of the baby in this period could have substantial effeccdts on a baby’s health. One avenue that might have a positive impact is during the actual pregnancy. Some women engage in prenatal activities that may be risky and others engage in prenatal activities that might be salubrious. In this study, We seek to determine whether prenatal care has an effect on the health of a newborn.
library(car)
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(sandwich)
library(ggplot2)
Variable Descriptions:
load("bwght_w203.RData")
desc
Data Quality Assessment
# First we want to determine whether there are missing data. If there are, we will then want to determine whether
# there is a pattern to those missing data.
apply(data, 2, function(x) mean(is.na(x)))
mage meduc monpre npvis fage feduc bwght omaps fmaps
0.000000000 0.016375546 0.002729258 0.037117904 0.003275109 0.025655022 0.000000000 0.001637555 0.001637555
cigs drink lbw vlbw male mwhte mblck moth fwhte
0.060043668 0.062772926 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
fblck foth lbwght magesq npvissq
0.000000000 0.000000000 0.000000000 0.000000000 0.037117904
The apply function shows that for many of the variables we are not missing any data, while there are several variables on which we are missing some data. The fact that the average number of drinks per night or cigarrettes per week variable has the most missing data could indicate that many mothers who did consume alcohol during their pregnacy refused to answer. This could potentially lead to nonresponse bias. The omaps and fmaps variables have the exact same number of missing data. If this happens on the same observation, then we might have hospital bias. Some hospitals might not do these measurements. There is also a very high percentage of missing data from the same people on the smoking and drinking quesiton.
Exploratory Analysis of each variable:
We start by looking at the independent variables available to measure level of prenatal care administered to mothers, as well as any variables that might be useful for controlling for other factors.
Age Variables:
hist(data$mage, breaks = 20)
summary(data$fage)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
18.00 28.00 31.00 31.92 35.00 64.00 6
hist(data$fage, breaks = 40)
summary(data$fage)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
18.00 28.00 31.00 31.92 35.00 64.00 6
The father and mother age variables both have fairly normal distributions around the 30-year mark, with the father age variable having a slight positive skew. This should mean that this dataset is fairly representative from an age perspective.
Education variables
hist(data$meduc)
summary(data$meduc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.00 12.00 13.00 13.72 16.00 17.00 30
hist(data$feduc)
summary(data$feduc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.00 12.00 14.00 13.92 16.00 17.00 47
cor(data[complete.cases(data$feduc) & complete.cases(data$meduc),]$feduc, data[complete.cases(data$feduc) & complete.cases(data$meduc),]$meduc)
[1] 0.5841373
The mother and father education variables both show signs of graduation effect at the 12 (high school) and 16 (college) mark. Therefore, we can bucket this variable into high school, college, and further education (see Variable Transformation below).
Additionally, there is about 60% correlation between the education of the mother and father of each baby in the dataset, which aligns with the general notion that people marry others of the same education level.
Length of Prenatal Care
hist(data$monpre, breaks = 9)
summary(data$monpre)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.000 2.000 2.122 2.000 9.000 5
The monpre variable signifies the month of pregnancy at which prenatal care started, and is one of the key variables we expect to help us predict infant health. Its range from zero to nine is sensible, although we don’t have anything to show us children that were born prematurely or later than nine months. In fact, such cases might be important because babies born prematurely tend to have smaller weight and face serious health risks. Therefore, the lack of this information might influence our conclusions.
The histogram of monpre shows that the data is heavily skewed to the right. There are large peaks at zero and one, meaning that most women start prenatal care early on in their pregnancy.
Number of Prenatal Visits
hist(data$npvis, breaks = 40)
summary(data$npvis)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 10.00 12.00 11.62 13.00 40.00 68
The number of prenatal visits is another key metric we expect will have an effect on infant health. It has a strong positive skew, and we can see a peak at 12 visits. This could be because mothers that took part in this study were be eligible to a certain number of prenatal visits (possibly 12) for free, leading to a lot of women making exactly that many visits. The question of who pays for these visits could have a confounding effect, because it is more likely for mothers to make more visits if they are free and they would probably make less visits if they aren’t.
Plotting the number of visits to the month at which prenatal care started (see below), we can see two possible outliers in the top right. These two mothers had around 30 visits, even though they started their prenatal care in the 8th month. It could be the case that both of these mothers were admitted into hospital for a long period for a specific health problem, but we should check if these two points have an influence on the regression line of our models.
plot(jitter(data$monpre), jitter(data$npvis))
Cigarettes
hist(data$cigs, breaks = 40)
hist(data$cigs[data$cigs > 0], breaks = 40)
nrow(data[data$cigs > 0,])/nrow(data)
[1] 0.1402838
summary(data$cigs)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 1.089 0.000 40.000 110
The cigarettes variable has a peak at zero and a strong positive skew. Only about 14% of the data is of actual smokers, so we could get some skewed results. The distribution is very dispersed, with peaks at the 10, 20, 30, and 40 mark. This shows that people tended to round to these values, perhaps because they didn’t know the exact number. It could also be the case that people tend do round down such values. This could lead to loss of important detail in the data, but the variable could still be used as an indicator of whether people are heavy, light or non-smokers in general (see variable transformation below).
Drinking
hist(data$drink)
hist(data$drink[data$drink > 0], breaks = 40)
summary(data$drink)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.0198 0.0000 8.0000 115
In this dataset, there are only about 16 people that actually drank any alcohol during their pregnancy, and they make up only about 1% of the data. While it makes sense that respondents would stop drinking (or at least say they did) during their pregnancy, such a small amount of actual drinkers means this variable will not be helpful in explaining variances in infant health.
Baby Gender
table(data$male)
0 1
891 941
There are approximately the same amounts of male babies as female babies. This might be a useful control variable, in the case when one gender is heavier in general than the other.
summary(data[data$male ==1,]$bwght)
Min. 1st Qu. Median Mean 3rd Qu. Max.
360 3090 3459 3441 3820 5160
summary(data[data$male ==0,]$bwght)
Min. 1st Qu. Median Mean 3rd Qu. Max.
506 3040 3380 3359 3718 5204
Race
nrow(data[data$mwhte == 1,])
[1] 1624
nrow(data[data$mblck == 1,])
[1] 109
nrow(data[data$moth == 1,])
[1] 99
nrow(data[data$fwhte == 1,])
[1] 1630
nrow(data[data$fblck == 1,])
[1] 107
nrow(data[data$foth == 1,])
[1] 95
hist(data$mwhte)
summary(data$mwhte)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 1.0000 1.0000 0.8865 1.0000 1.0000
This study covers 1624 white, 109 black and 99 women of other races. The results for men is similar. While we don’t expect race to have any effect on infant health, it is worth noting that the data is very skewed.
We now look at the prospective dependent variables we might use to measure infant health.
Birth Weight
hist(data$bwght, breaks = 50)
summary(data$bwght)
Min. 1st Qu. Median Mean 3rd Qu. Max.
360 3076 3425 3401 3770 5204
The birth weight variable is fairly normally distributied, even thought it has a small negative skew. While it doesn’t require any transformation, we can take the log10 of birth weight, since a percent change in weight is something that makes intuitive sense. This is a very good candidate for our dependent variable, because it is a continuous quantitative feature that is generally considered a good indicator of overall baby health.
One- and Five- Minute APGAR Scores
summary(data$omaps)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 8.000 9.000 8.386 9.000 10.000 3
summary(data$fmaps)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2.000 9.000 9.000 9.004 9.000 10.000 3
The one- and five- minute APGAR scores are a quick and easy scale that doctors can use to judge how healthy a newly-born infant looks one minute and five minute after it is born. It is generally used as a measure of the short-term health of the baby, helping doctors decide whether a baby will need immediate medical assistance. It is calculated by the doctor judging a baby on a scale of 0 to 2 on five different metrics (activity, pulse, grimace, appearance, respiration).
Therefore, this means that the APGAR scores are essentially an ordinal variable. The scores in the dataset are mostly larger than 8, which means that we don’t have a lot of lower results to help us determine the effect of prenatal care in all cases. That is why we don’t think the APGAR scores are a good measure for overall baby health, and we will not use them in our regression.
Variable transformation
data$npvismonth <- data$npvis/(10-data$monpre)
data$nonsmoker <- ifelse(data$cigs == 0, 1, 0)
data$smoker <- ifelse(data$cigs > 0, 1, 0)
data$lightsmoker <- ifelse(data$cigs > 0 & data$cigs <= 10, 1, 0)
data$occsmoker <- ifelse(data$cigs > 10 & data$cigs <= 20, 1, 0)
data$heavysmoker <- ifelse(data$cigs > 20, 1, 0)
data$drinker <- ifelse(data$drink >= 1, 1, 0)
data$mheduc <- ifelse(data$meduc <= 12, 1, 0)
data$fheduc <- ifelse(data$feduc <= 12, 1, 0)
data$mceduc <- ifelse(data$meduc > 12 & data$meduc <=16, 1, 0)
data$fceduc <- ifelse(data$feduc > 12 & data$feduc <=16, 1, 0)
data$mfureduc <- ifelse(data$meduc > 16, 1, 0)
data$ffureduc <- ifelse(data$feduc > 16, 1, 0)
Different model testing
# Best model
model8 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage + fwhte + fblck + foth + mwhte + mblck + moth + mheduc + mceduc, data=data)
model6 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage + fwhte + fblck + foth + mwhte + mblck + moth, data=data)
model1 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage, data=data)
model4 <- lm(lbwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage, data=data)
model5 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage + fwhte + fblck + foth, data=data)
# Trying to use a drinker indicator variable
model2 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drinker + male + fage, data=data)
# Using cigs instead of indicator variable
model3 <- lm(bwght ~ npvis + monpre + cigs + male + fage, data=data)
# With almost every variable
model7 <- lm(bwght ~ npvis + monpre + lightsmoker + occsmoker + heavysmoker + drink + male + fage + fwhte + fblck + foth + mwhte + mblck + moth + feduc, data=data)
# Stan model suggestions: should we use npvissq? It could be that number of visits has an effect that changes with the number of visits (parabolic relationship)
models1 = lm(bwght ~ monpre + npvis, data = data)
models2 = lm(bwght ~ monpre + npvis + npvissq, data = data)
AIC(models1)
[1] 27428.8
AIC(models2)
[1] 27421.13
summary(models1)
Call:
lm(formula = bwght ~ monpre + npvis, data = data)
Residuals:
Min 1Q Median 3Q Max
-2906.08 -326.21 21.36 359.01 1823.72
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3161.271 59.917 52.761 < 2e-16 ***
monpre 17.062 11.727 1.455 0.146
npvis 17.549 3.925 4.471 8.28e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 577.5 on 1760 degrees of freedom
(69 observations deleted due to missingness)
Multiple R-squared: 0.01124, Adjusted R-squared: 0.01011
F-statistic: 9.999 on 2 and 1760 DF, p-value: 4.806e-05
summary(models2)
Call:
lm(formula = bwght ~ monpre + npvis + npvissq, data = data)
Residuals:
Min 1Q Median 3Q Max
-2810.82 -326.02 20.53 353.18 1848.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2916.1545 98.8982 29.486 < 2e-16 ***
monpre 29.0077 12.3118 2.356 0.0186 *
npvis 50.7033 11.3540 4.466 8.49e-06 ***
npvissq -1.1144 0.3582 -3.111 0.0019 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 576.1 on 1759 degrees of freedom
(69 observations deleted due to missingness)
Multiple R-squared: 0.01665, Adjusted R-squared: 0.01497
F-statistic: 9.925 on 3 and 1759 DF, p-value: 1.729e-06
coeftest(models1, vcov = vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3161.2707 74.6049 42.3735 < 2.2e-16 ***
monpre 17.0622 12.0277 1.4186 0.1561984
npvis 17.5494 4.8342 3.6302 0.0002913 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
coeftest(models2, vcov = vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2916.15453 139.92101 20.8414 < 2.2e-16 ***
monpre 29.00774 13.45221 2.1564 0.031191 *
npvis 50.70326 16.08255 3.1527 0.001645 **
npvissq -1.11441 0.51242 -2.1748 0.029778 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CLM Assumptions
Regression Table
Causality discussion
Bivariate Plots
ggplot(data = data, aes(x = data$monpre, y = data$bwght)) + geom_point() + geom_smooth(method = "loess") # This strongly suggests that the month at which you start prenatal care has no impact on the birthweight of the baby.
ggplot(data = data, aes(x = data$npvis, y = data$bwght)) + geom_point() + geom_smooth(method = "loess") # serious outliers towards the lower left of the plot, but these will be outweighed by the many other variables whose x values are the same. There are scarly any data beyond 20 npvis. There seems to be a small positive effect to having over 5 visits.
Correlation Matrix
cor_test <- as.data.frame(cor(data, use = "complete.obs"))
cor_test
Some high correlations: Mage and meduc = 0.33 mage and fage = 0.689 feduc and meduc = 0.59 magesq & mage = 0.99 magesq & meduc 0.319 npvissq & npvis = 0.93
mod_key <- lm(bwght ~ monpre + npvis, data = data)
summary(mod_key)
Call:
lm(formula = bwght ~ monpre + npvis, data = data)
Residuals:
Min 1Q Median 3Q Max
-2906.08 -326.21 21.36 359.01 1823.72
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3161.271 59.917 52.761 < 2e-16 ***
monpre 17.062 11.727 1.455 0.146
npvis 17.549 3.925 4.471 8.28e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 577.5 on 1760 degrees of freedom
(69 observations deleted due to missingness)
Multiple R-squared: 0.01124, Adjusted R-squared: 0.01011
F-statistic: 9.999 on 2 and 1760 DF, p-value: 4.806e-05
plot(mod_key)
mod_cov <- lm(bwght ~ monpre + npvis + cigs + meduc + mage, data = data)
summary(mod_cov)
Call:
lm(formula = bwght ~ monpre + npvis + cigs + meduc + mage, data = data)
Residuals:
Min 1Q Median 3Q Max
-2959.75 -329.35 13.13 347.54 1779.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3060.656 131.292 23.312 < 2e-16 ***
monpre 21.546 12.073 1.785 0.074509 .
npvis 13.495 3.937 3.428 0.000624 ***
cigs -10.960 3.346 -3.275 0.001078 **
meduc 2.171 7.187 0.302 0.762663
mage 4.365 3.157 1.382 0.167060
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 567 on 1637 degrees of freedom
(189 observations deleted due to missingness)
Multiple R-squared: 0.01592, Adjusted R-squared: 0.01292
F-statistic: 5.298 on 5 and 1637 DF, p-value: 7.796e-05
plot(mod_cov)
mod_cov_minus <- lm(bwght ~ monpre + npvis + cigs + feduc + fage + omaps + male, data = data)
summary(mod_cov_minus)
Call:
lm(formula = bwght ~ monpre + npvis + cigs + feduc + fage + omaps +
male, data = data)
Residuals:
Min 1Q Median 3Q Max
-2468.7 -338.7 15.3 348.9 1818.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2332.465 159.708 14.605 < 2e-16 ***
monpre 23.302 11.942 1.951 0.051194 .
npvis 11.510 3.884 2.963 0.003088 **
cigs -8.198 3.332 -2.460 0.013987 *
feduc 4.666 6.335 0.736 0.461536
fage 5.233 2.496 2.096 0.036219 *
omaps 74.753 12.560 5.951 3.25e-09 ***
male 92.607 27.375 3.383 0.000734 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 550.5 on 1614 degrees of freedom
(210 observations deleted due to missingness)
Multiple R-squared: 0.04362, Adjusted R-squared: 0.03947
F-statistic: 10.52 on 7 and 1614 DF, p-value: 5.565e-13
plot(mod_cov_minus)
install.packages("leaps")
library(leaps)
str(data)
regsubsets_out <- regsubsets(bwght ~ mage + meduc + monpre + npvis + fage + feduc + cigs + drink + male + mwhte + mblck + fwhte + fblck + npvissq, data = data, nbest = 1, nvmax = NULL, force.in = NULL, force.out = NULL, method = "exhaustive")
regsubsets_out
summary_out <- summary(regsubsets_out)
as.data.frame(summary_out$outmat)
plot(regsubsets_out, scale = "adjr2", main = "Adjusted R^2")
======= Conclusion >>>>>>> 22095f252bf00e33d2b896c4ef9ca6615deb1119